17 Pandas: Working with Dataframes

17.1 Introduction

In economics and data science, we often work with tabular data, whether they come from:

Surveys and administrative records
Macroeconomic databases (e.g., World Bank, IMF)
Financial market data feeds
Census and demographic data
Firm-level accounting statements

While NumPy arrays are highly efficient for numerical operations, they lack some of the conveniences required to handle real-world economic data:

Row and column labels (e.g., country names, dates)
Mixed data types (e.g., numeric GDP values with categorical country codes)
Missing values (e.g., unreported statistics for some countries/years)
Hierarchical relationships (e.g., firms nested within industries within countries)

This is where pandas excels. Built on top of NumPy, pandas provides flexible, Pythonic data structures—namely Series (1D) and DataFrames (2D)—that support common operations like data cleaning, subsetting, merging, and reshaping.

The DataFrame in pandas is the closest analog to what economists and social scientists might see in Stata, R, or Excel: rows representing observations, columns representing variables, and the ability to label these axes in intuitive ways. This chapter will guide you through:

Data Structures: Understanding Series and DataFrames
Data Loading: Reading from files and databases
Data Cleaning: Handling missing values and data types
Data Manipulation: Filtering, sorting, and transforming
Data Analysis: Computing statistics and aggregations
Data Export: Saving results in various formats

These skills are crucial for real-world applications in economics and machine learning, where data rarely comes in a perfectly clean and ready-to-use format.

17.2 Pandas Series: The Building Block of DataFrames

Before diving deeper into DataFrames, it’s crucial to understand Series, the fundamental building block of pandas. A Series is a one-dimensional labeled array that can hold any data type (integers, strings, floats, Python objects, etc.). In economic data analysis, individual Series often represent:

Time series of a single variable (e.g., monthly GDP values)
Cross-sectional observations (e.g., population across countries)
Categories or classifications (e.g., industry sectors)

Think of a Series as analogous to a single column in Stata or a vector in R, but with added capabilities for handling missing data and maintaining row labels (index). This labeling is particularly valuable when working with economic data where observations often have meaningful identifiers like country names, dates, or industry codes.

17.2.1 Relationship with NumPy

Since pandas is built on top of NumPy, there’s a close relationship between pandas Series/DataFrames and NumPy arrays:

import numpy as np
import pandas as pd

# Convert NumPy array to Series
arr = np.array([1, 2, 3])
s = pd.Series(arr)

# Convert Series to NumPy array
arr_back = s.to_numpy()

# Convert 2D NumPy array to DataFrame
arr_2d = np.array([
    [1, 2, 3], 
    [4, 5, 6]
])
df = pd.DataFrame(
    arr_2d, 
    columns=['A', 'B', 'C']
)

# Convert DataFrame to 2D NumPy array
arr_2d_back = df.to_numpy()

# NumPy functions work on Series/DataFrames
# Element-wise square root of Series
print(np.sqrt(s))      
# Mean of each DataFrame column
print(np.mean(df))     
# Exponential of one column
print(np.exp(df['A'])) 

# Universal functions (ufuncs) work too
# Element-wise exponential
print(np.exp(s))

While pandas builds on NumPy’s efficient array operations, there are several key differences that make pandas Series more flexible for data analysis. First, Series have an index that labels each position in the array, allowing for intuitive data access using meaningful identifiers rather than just integer positions. This is particularly useful when working with time series data or when rows represent distinct entities like countries or companies. Additionally, Series preserve pandas-specific features that aren’t available in NumPy arrays, such as sophisticated missing value handling.

17.2.2 Creating Series

You can create a Series from a list, NumPy array, or dictionary:

import pandas as pd
import numpy as np

# From a list
s1 = pd.Series(
    [10, 20, 30, 40]
)
print(s1)
# 0    10
# 1    20
# 2    30
# 3    40
# dtype: int64

# From a NumPy array
s2 = pd.Series(
    np.array([1, 3, 5, 7])
)
print(s2)
# 0    1
# 1    3
# 2    5
# 3    7
# dtype: int64

# From a dictionary (keys become index)
s3 = pd.Series({
    'a': 100, 
    'b': 200, 
    'c': 300
})
print(s3)
# a    100
# b    200
# c    300
# dtype: int64

17.3 Pandas DataFrames

A DataFrame is essentially a collection of Series objects that share the same index. Each column in a DataFrame is a Series. This relationship is crucial for understanding how to extract and manipulate data within a DataFrame.

17.3.1 Creating a DataFrame by combining Series

You can create a DataFrame by combining multiple Series objects. The Series must share the same index for proper alignment:

# Create individual Series
gdp = pd.Series(
    [19.4, 3.0, 4.9], 
    index=['USA', 'UK', 'Japan']
)
pop = pd.Series(
    [332, 67, 126], 
    index=['USA', 'UK', 'Japan']
)

# Combine into DataFrame
df = pd.DataFrame({
    'GDP_trillion': gdp,
    'Population_million': pop
})

print(df)
#       GDP_trillion  Population_million
# USA          19.4               332.0
# UK            3.0                67.0
# Japan         4.9               126.0

The resulting DataFrame uses the Series’ shared index as row labels and the dictionary keys as column names.

Later we will see how to import data from files into DataFrames.

17.3.2 Extracting a Series from a DataFrame

You can extract a column (which is a Series) from a DataFrame using bracket notation:

# Assuming df is a DataFrame
gdp_series = df["GDP"]
print(type(gdp_series))
# <class 'pandas.core.series.Series'>

17.3.3 Adding a Series to a DataFrame

You can add a Series as a new column to a DataFrame:

# Create a new Series
new_series = pd.Series(
    [100, 200, 300],
    index=['USA', 'UK', 'Japan']
)

# Add it as a new column 
df["New_Column"] = new_series
print(df)
#        GDP  New_Column
# USA   100       100
# UK    200       200
# Japan 300       300

17.4 Understanding Row Identification in pandas

Before diving into specific operations, it’s important to understand how pandas identifies and organizes rows in a DataFrame. This conceptual foundation is crucial for effectively working with data, especially when performing operations like filtering, grouping, or panel data analysis.

17.4.1 Index Types

Default Index
- When you create a DataFrame without specifying an index, pandas assigns a default integer index (0, 1, 2, …).
```
df = pd.DataFrame({'GDP': [100, 200, 300]})
print(df)
#    GDP
# 0  100
# 1  200
# 2  300
```
Label Index
- You can assign meaningful labels to rows using any hashable type (strings, integers, dates).
- Common in economics for time series or cross-sectional data.
```
df.index = ['USA', 'UK', 'Japan']
print(df)
#       GDP
# USA   100
# UK    200
# Japan 300
```

Multi-Level (Hierarchical) Index

Rows can be identified by multiple levels of labels.
Essential for panel data and nested categorizations.

# Create DataFrame with multi-level index
df = pd.DataFrame({
    'GDP': [100, 120, 200, 220],
    'Year': [2020, 2021, 2020, 2021],
    'Country': ['USA', 'USA', 'UK', 'UK']
})

# Set multi-level index
df_panel = df.set_index(
    ['Country', 'Year']
)
print(df_panel)
#                GDP
# Country Year     
# USA     2020  100
#         2021  120
# UK      2020  200
#         2021  220

17.4.2 Working with Indices

Setting and Resetting

# Set a single column as index
df = df.set_index('date')

# Set multiple columns as index levels
df = df.set_index(['country', 'year'])

# Convert index back to regular columns
df = df.reset_index()

Index Operations

# Check index properties
print(df.index.names)  # Names of index levels
print(df.index.levels) # Values in each level

# Sort by index
df = df.sort_index()

Accessing Data

Different index types enable various data access patterns:

# Single-level index
gdp_usa = df.loc['USA']

# Multi-level index
gdp_usa_2020 = df.loc[('USA', 2020)]

Understanding these index concepts is fundamental to effective data analysis in pandas. The index system determines how you select and filter data, enables powerful grouping and aggregation operations, and provides natural representation for hierarchical data structures. This flexibility is particularly important when working with time series, where dates serve as indices, or when dealing with panel data that requires multiple levels of identification.

The hierarchical indexing capabilities of pandas make it especially powerful for handling complex data structures. Whether you’re analyzing panel data with country-year observations, studying nested relationships across industries and regions over time, or conducting cross-sectional studies with multiple categorical dimensions, the index system provides the foundation for efficient data organization and analysis.

17.5 Loading and Inspecting Datasets

Real-world data often exist in CSV files, Excel spreadsheets, or databases. pandas provides convenient functions to load data from these sources. For example, we load a dataset in CSV format which contain details of atheletes who participate in a particular year’s Olympic games.

import pandas as pd

# Reading a CSV file
df = pd.read_csv("../data/athletes.csv")

# Peek at the first few rows
print(df.head())

# Get DataFrame shape (rows, columns)
print(df.shape)

# Check column names
print(df.columns)

          id            name nationality     sex       dob  height  weight  \
0  736041664  A Jesus Garcia         ESP    male  10/17/69    1.72    64.0   
1  532037425      A Lam Shin         KOR  female   9/23/86    1.68    56.0   
2  435962603     Aaron Brown         CAN    male   5/27/92    1.98    79.0   
3  521041435      Aaron Cook         MDA    male    1/2/91    1.83    80.0   
4   33922579      Aaron Gate         NZL    male  11/26/90    1.81    71.0   

       sport  gold  silver  bronze  
0  athletics     0       0       0  
1    fencing     0       0       0  
2  athletics     0       0       1  
3  taekwondo     0       0       0  
4    cycling     0       0       0  
(11538, 11)
Index(['id', 'name', 'nationality', 'sex', 'dob', 'height', 'weight', 'sport',
       'gold', 'silver', 'bronze'],
      dtype='object')

pd.read_csv(): Reads comma-separated values (CSV). You can customize delimiters (sep='\t' for TSV files), specify headers, handle missing values, and more via additional parameters.
df.head(): Shows the first 5 rows (or another number you specify).
df.shape: Returns a tuple \((\text{rows},\text{columns})\).
df.columns: Lists the column names, which is helpful to quickly see the variables available in the dataset.

You can also load Excel files with pd.read_excel("file.xlsx"), JSON data with pd.read_json("file.json"), and more.

If you’d like to work along with the examples in this chapter, you can download it from https://mlbook.jyotirmoy.net/static/data/athletes.csv. Adjust the path in the read_csv call to the path where you saved the file.

17.5.1 Examining Continuous Columns

Many datasets have both continuous (height, weight) and categorical (e.g., sport, nationality) columns. You can quickly describe numeric and non-numeric data:

# Summary of numeric columns
print(df.describe())

                 id        height        weight          gold        silver  \
count  1.153800e+04  11208.000000  10879.000000  11538.000000  11538.000000   
mean   4.999885e+08      1.766282     72.068205      0.057722      0.056769   
std    2.908648e+08      0.112719     16.177334      0.255910      0.239147   
min    1.834700e+04      1.210000     31.000000      0.000000      0.000000   
25%    2.450997e+08      1.690000     60.000000      0.000000      0.000000   
50%    5.002011e+08      1.760000     70.000000      0.000000      0.000000   
75%    7.539874e+08      1.840000     81.000000      0.000000      0.000000   
max    9.999878e+08      2.210000    170.000000      5.000000      2.000000   

             bronze  
count  11538.000000  
mean       0.061016  
std        0.243320  
min        0.000000  
25%        0.000000  
50%        0.000000  
75%        0.000000  
max        2.000000

The output shows summary statistics for numeric columns:

count: Number of non-null (non-missing) values for each column
mean: Average value
std: Standard deviation
min: Minimum value
25%: First quartile (25th percentile)
50%: Median (50th percentile)
75%: Third quartile (75th percentile)
max: Maximum value

17.5.2 Examining categorical columns

print(df.describe(include=["object", "category"]))

                   name nationality    sex      dob      sport
count             11538       11538  11538    11537      11538
unique            11517         207      2     5595         28
top     Gabriella Szucs         USA   male  2/18/93  athletics
freq                  2         567   6333        9       2363

For categorical data, .describe() shows count, unique category count, top category, and frequency of the top category.

17.6 Data Types (`dtypes`)

Each column in a DataFrame has an associated data type, which can significantly impact memory usage and performance. Common dtypes include float, int, bool, datetime, and object (for strings or mixed content).

print(df.dtypes)

id               int64
name            object
nationality     object
sex             object
dob             object
height         float64
weight         float64
sport           object
gold             int64
silver           int64
bronze           int64
dtype: object

This reveals how pandas interprets each column. If a particular column (like a categorical variable or date) is not recognized correctly, you can change its type:

When reading CSV files, string columns are automatically stored as ‘object’ dtype, which is pandas’ way of handling Python strings and mixed data types. While flexible, object dtypes consume more memory because each value is stored as a separate Python object with its own overhead.

For columns with a limited set of unique values (like ‘sex’, ‘nationality’), converting to ‘category’ dtype can significantly reduce memory usage and improve performance:

# Convert string columns to categorical
df['sex'] = df['sex'].astype('category')
df['nationality'] = df['nationality'].astype('category')
df['sport'] = df['sport'].astype('category')

The dob column was not correctly recognized that as a date. Lets convert that too:

# Convert date columns from object to datetime
# For example, if 'dob' column contains dates like '01/15/95':
df['dob'] = pd.to_datetime(
    df['dob'], 
    format='%m/%d/%y'
)
# Now df['dob'] is datetime64[ns] type instead of object

17.7 Cross-tabulation

Cross-tabulation (or contingency tables) is a method to analyze the relationship between categorical variables by displaying their frequency distributions in a table format. In economics, this is particularly useful for:

Analyzing employment status across education levels
Studying industry distribution across regions
Examining trade relationships between countries
Understanding market segment distributions

pandas provides the pd.crosstab() function for creating these tables:

# Basic cross-tabulation
pd.crosstab(
    df['nationality'],     # Row variable
    df['sex']     # Column variable
)

sex	female	male
nationality
AFG	1	2
ALB	3	3
ALG	10	58
AND	3	2
ANG	18	8
...	...	...
VIE	13	10
VIN	2	2
YEM	1	2
ZAM	2	5
ZIM	26	9

207 rows × 2 columns

# With margins (row/column totals)
pd.crosstab(
    df['nationality'],
    df['sex'],
    margins=True        # Add row/column totals
)

sex	female	male	All
nationality
AFG	1	2	3
ALB	3	3	6
ALG	10	58	68
AND	3	2	5
ANG	18	8	26
...	...	...	...
VIN	2	2	4
YEM	1	2	3
ZAM	2	5	7
ZIM	26	9	35
All	5205	6333	11538

208 rows × 3 columns

# Normalize by rows (proportions)
pd.crosstab(
    df['nationality'],
    df['sex'],
    normalize='index'   # Row percentages
)

sex	female	male
nationality
AFG	0.333333	0.666667
ALB	0.500000	0.500000
ALG	0.147059	0.852941
AND	0.600000	0.400000
ANG	0.692308	0.307692
...	...	...
VIE	0.565217	0.434783
VIN	0.500000	0.500000
YEM	0.333333	0.666667
ZAM	0.285714	0.714286
ZIM	0.742857	0.257143

207 rows × 2 columns

# Normalize by columns (proportions)
pd.crosstab(
    df['nationality'],
    df['sex'],
    normalize='columns'  # Column percentages
)

# Multiple row variables by 
# providing a list of columns
# as the rows
pd.crosstab(
    [df['nationality'],df['sport']],
    df['sex']  
)

	sex	female	male
nationality	sport
AFG	athletics	1	1
AFG	judo	0	1
ALB	aquatics	1	1
	athletics	1	1
	weightlifting	1	1
...	...	...	...
ZIM	athletics	1	5
	equestrian	1	0
	football	22	0
	rowing	1	1
	shooting	0	1

1776 rows × 2 columns

We can similarly have multiple columns to create a hierarchical table.

The aggfunc parameter in crosstab determines how to combine multiple values that fall into the same category combinations. For example, when analyzing company data by Industry and Region:

sum: Adds up all values (e.g., total revenue for each Industry-Region pair)
mean: Calculates average (e.g., average company size per Industry-Region)
count: Counts occurrences (e.g., number of companies in each Industry-Region)
min/max: Finds extreme values (e.g., largest company in each Industry-Region) Custom functions can also be used for more complex aggregations.

Suppose for each country we want the average height men and women.

# With aggregation function
pd.crosstab(
    df['nationality'],
    df['sex'],
    values=df['height'],    # Values to aggregate
    aggfunc='mean'           # Aggregation function
)

sex	female	male
nationality
AFG	1.650000	1.780000
ALB	1.663333	1.860000
ALG	1.664000	1.772241
AND	1.676667	1.765000
ANG	1.754444	1.780000
...	...	...
VIE	1.643846	1.690000
VIN	1.780000	1.730000
YEM	1.660000	1.715000
ZAM	1.760000	1.770000
ZIM	1.643462	1.770000

202 rows × 2 columns

The aggfunc parameter determines how values are combined when multiple rows map to the same Industry-Region combination. For example, if you have multiple companies in the same industry and region, aggfunc='sum' will add their revenues together. Other common options include:

‘mean’: Average value (e.g., average revenue per industry-region)
‘count’: Number of occurrences
‘min’/‘max’: Extreme values
Custom functions can also be used for more complex aggregations

17.8 Computing Aggregates and Summary Statistics

Summaries and group-level statistics are integral to data analysis in economics. With pandas, you can compute aggregates across the entire dataset or grouped by certain columns. These operations are essential for understanding data distributions, identifying patterns, and preparing summary tables for research papers.

17.8.1 Simple Aggregations

# Basic statistics
print("Mean:",df["height"].mean())      
print("Median:", df["height"].median())    
print("Standard deviation:", df["height"].std())
print("Variance:",df["height"].var())

# Range statistics
print("Min:",df["height"].min()) 
print("Max:",df["height"].max())
print("First quartile:",df["height"].quantile(0.25)) 
print() 
print("Multiple quartiles:")
print(df["height"].quantile([0.25, 0.5, 0.75]))
print()

# Counts
print("No. of non-missing values: ",df["nationality"].count())
print("No. of unique values; ",df["nationality"].nunique())

Mean: 1.766282119914347
Median: 1.76
Standard deviation: 0.11271869117624804
Variance: 0.012705503340486377
Min: 1.21
Max: 2.21
First quartile: 1.69

Multiple quartiles:
0.25    1.69
0.50    1.76
0.75    1.84
Name: height, dtype: float64

No. of non-missing values:  11538
No. of unique values;  207

You can apply these directly to columns or the entire DataFrame (if all columns are numeric).

17.8.2 Grouped Aggregations

The .groupby() function is pandas’ primary tool for grouped aggregations. It lets you:

Split the data into groups
Apply operations to each group independently
Combine the results into a new DataFrame

When you perform a groupby operation, pandas automatically creates an index from your grouping columns, following the index concepts introduced earlier:

# Group by country and compute mean height
# Result is a new DataFrame
height_by_nat = df.groupby("nationality")["height"].mean()
print(height_by_nat)

nationality
AFG    1.736667
ALB    1.761667
ALG    1.756324
AND    1.712000
ANG    1.762308
         ...   
VIE    1.663913
VIN    1.755000
YEM    1.696667
ZAM    1.766667
ZIM    1.676000
Name: height, Length: 207, dtype: float64

/tmp/ipykernel_4785/614891609.py:3: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  height_by_nat = df.groupby("nationality")["height"].mean()

# Multiple aggregations for one column
# Result has country as index, aggregation functions as columns
height_by_nat = df.groupby("nationality")["height"].agg(["mean", "std"])
print(height_by_nat)

                 mean       std
nationality                    
AFG          1.736667  0.080829
ALB          1.761667  0.157913
ALG          1.756324  0.085935
AND          1.712000  0.054498
ANG          1.762308  0.065685
...               ...       ...
VIE          1.663913  0.073407
VIN          1.755000  0.035355
YEM          1.696667  0.040415
ZAM          1.766667  0.069186
ZIM          1.676000  0.093971

[207 rows x 2 columns]

/tmp/ipykernel_4785/1827147528.py:3: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  height_by_nat = df.groupby("nationality")["height"].agg(["mean", "std"])

You can group by multiple columns to create a hierarchical structure:

# Group by sector and region, 
# aggregating multiple columns
summaries = df.groupby(
    ["nationality", "sport"]
).agg({
    "height": ["mean", "std"],
    "weight": ["mean", "std"]
})
print(summaries)

                          height           weight          
                            mean       std   mean       std
nationality sport                                          
AFG         aquatics         NaN       NaN    NaN       NaN
            archery          NaN       NaN    NaN       NaN
            athletics        1.7  0.070711   61.5  9.192388
            badminton        NaN       NaN    NaN       NaN
            basketball       NaN       NaN    NaN       NaN
...                          ...       ...    ...       ...
ZIM         tennis           NaN       NaN    NaN       NaN
            triathlon        NaN       NaN    NaN       NaN
            volleyball       NaN       NaN    NaN       NaN
            weightlifting    NaN       NaN    NaN       NaN
            wrestling        NaN       NaN    NaN       NaN

[5796 rows x 4 columns]

/tmp/ipykernel_4785/3630171297.py:3: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  summaries = df.groupby(

When grouping by multiple columns like this, pandas creates a hierarchical index (also called a multi-level index). This structure is particularly valuable in economics because it allows you to analyze data at different levels of granularity.

17.8.3 Working with Multi-Level Indices

Multi-level indices are powerful tools for analyzing hierarchically structured data. They’re particularly valuable in economics for analyzing nested relationships like country-year observations or sector-region combinations. Here’s how to work with them effectively:

Accessing Data: There are several ways to select data from a multi-level index:

# Get all data for one country (returns all regions for USA)
print(summaries.loc["USA"])

                     height               weight           
                       mean       std       mean        std
sport                                                      
aquatics           1.840357  0.116902  78.952381  14.874499
archery            1.785000  0.017321  73.250000   9.639329
athletics          1.779194  0.099475  73.991935  22.336851
badminton          1.704286  0.099976  65.714286  13.187295
basketball         1.941667  0.106022  90.833333  15.488191
boxing             1.730000  0.039641        NaN        NaN
canoe              1.742000  0.072595  67.800000  10.134101
cycling            1.732857  0.107617  66.666667  13.778002
equestrian         1.718000  0.075328  63.533333   7.199868
fencing            1.775294  0.084788  73.764706   8.257795
football           1.690909  0.059595  60.363636   5.867955
golf               1.802857  0.101606  74.428571  11.428869
gymnastics         1.629444  0.092956  55.500000   9.847365
handball                NaN       NaN        NaN        NaN
hockey             1.641667  0.058737  58.444444   4.408844
judo               1.720000  0.094446  71.833333  14.232592
modern pentathlon  1.796667  0.076376  67.000000   9.643651
rowing             1.863171  0.083858  80.414634  12.878229
rugby sevens       1.784400  0.100378  82.240000  13.675647
sailing            1.774667  0.083740  70.200000  10.837765
shooting           1.741500  0.093768  74.300000  11.868135
table tennis       1.701667  0.091742  56.166667  10.647378
taekwondo          1.825000  0.084261  83.750000  17.914147
tennis             1.820000  0.077460  73.800000   7.420692
triathlon          1.761667  0.041673  62.500000   4.722288
volleyball         1.916875  0.092612  82.250000   9.830827
weightlifting      1.662500  0.102754  89.750000  40.028115
wrestling          1.732143  0.079825  81.000000  22.769751

# Get data for a specific nation-sport pair using a tuple
print(summaries.loc[("USA", "tennis")])

height  mean     1.820000
        std      0.077460
weight  mean    73.800000
        std      7.420692
Name: (USA, tennis), dtype: float64

# Cross-section: get all football players
# .xs() is especially useful for selecting from inner levels
print(summaries.xs("football", level="sport"))

               height               weight          
                 mean       std       mean       std
nationality                                         
AFG               NaN       NaN        NaN       NaN
ALB               NaN       NaN        NaN       NaN
ALG          1.771818  0.089954  69.363636  6.245345
AND               NaN       NaN        NaN       NaN
ANG               NaN       NaN        NaN       NaN
...               ...       ...        ...       ...
VIE               NaN       NaN        NaN       NaN
VIN               NaN       NaN        NaN       NaN
YEM               NaN       NaN        NaN       NaN
ZAM               NaN       NaN        NaN       NaN
ZIM          1.639091  0.055370  58.409091  5.670803

[207 rows x 4 columns]

Index Operations: You can reshape your data structure:

# Convert hierarchical index into regular columns
# Useful when you need a flat table format
df_flat = summaries.reset_index()

# Change the order of index levels (e.g., sport-nationality instead of nationality-sport)
df_reordered = summaries.swaplevel()

# Sort the index for faster data access
df_sorted = summaries.sort_index()

Aggregations: Calculate statistics at different hierarchical levels:

# The level= parameter is crucial when grouping by index levels instead of columns
# Without level=, pandas would try to find a column named "sport" and raise a KeyError
# With level="sport", pandas knows to group by the "sport" level of the index
print(summaries.groupby(level="sport").mean())

Grouped operations are essential for subsetting data by categories (e.g., region, industry, time period) and calculating summary statistics (e.g., average GDP, total population). They form the backbone of many empirical economics workflows (think cross-section vs. panel-level computations, or region-by-region analysis).

17.9 Selecting Rows

17.9.1 Integer Location: `iloc`

Use .iloc for integer-based indexing:

# Single row
row_5 = df.iloc[5]

# Multiple contiguous rows
first_ten = df.iloc[0:10]

# Non-contiguous row indices
some_rows = df.iloc[[2, 5, 10]]

17.9.2 Label-Based Selection: `loc`

When a DataFrame has labeled indices (e.g., custom row labels or dates as an index), .loc is used:

# If rows are labeled by date, for example
row_date = df.loc["2020-01-01"]

# Slicing by label
subset_label = df.loc["2020-01-01":"2020-01-10"]

17.9.3 Boolean Indexing (Filtering)

Boolean indexing is a powerful feature in pandas that lets you filter data using logical conditions. It works by creating a boolean mask (an array of True/False values) that determines which rows to keep.

17.9.3.1 Basic Boolean Indexing

# Create a simple boolean mask
high_income = df["income"] > 50000
print(high_income)  # Shows True/False for each row
# 0     True
# 1    False
# 2     True
# ...

# Use the mask to filter the DataFrame
rich_people = df[high_income]

17.9.3.2 Multiple Conditions

You can combine multiple conditions using logical operators:

& (and)
| (or)
~ (not)

Important: Use parentheses around each condition and special operators & and | and not and and or as in Python:

# Wrong: 
# df[df["income"] > 50000 and 
#    df["country"] == "USA"]

# Correct:
rich_americans = df[
    (df["income"] > 50000) & 
    (df["country"] == "USA")
]

# Multiple complex conditions
target_group = df[
    (df["income"] > 50000) &  # High income
    (df["age"] < 40) &        # Young
    (
        (df["country"] == "USA") | 
        (df["country"] == "Canada")
    )  # North American
]

17.9.3.3 Common Use Cases

Filtering outliers:

# Remove extreme values 
# (e.g., beyond 3 standard deviations)
mean = df["value"].mean()
std = df["value"].std()
clean_df = df[
    (df["value"] >= mean - 3*std) & 
    (df["value"] <= mean + 3*std)
]

Time period isolation:

# Data from 2020 onwards
recent_data = df[df["date"] >= "2020-01-01"]

Category-based filtering:

# Large companies in specific sectors
large_tech = df[
    (df["market_cap"] > 1e9) &  # Over 1B
    (df["sector"].isin([
        "Technology", 
        "Communications"
    ]))
]

Missing value handling:

# Find rows with complete data for specific columns
complete_cases = df[df[["income", "education", "age"]].notna().all(axis=1)]

Remember that boolean masks can be saved to variables for clarity and reuse, making your code more readable and maintainable.

17.10 Selecting Columns

Columns in a DataFrame can be treated much like keys in a dictionary:

# Single column
gdp_col = df["nationality"]

# Multiple columns
subset_cols = df[["nationality", "sport"]]

# Dot-notation works if the column name has no spaces or special characters
df.nationality

If your column names contain spaces or punctuation, you must stick with bracket notation (e.g., df["GDP (in billions)"]).

17.10.1 Advanced Column Selection

You can select columns based on their data types or using pattern matching:

# Select all numeric columns
numeric_cols = df.select_dtypes(include=['int64', 'float64'])

# Select all string (object) columns
string_cols = df.select_dtypes(include=['object'])

# Select multiple types
categorical_or_string = df.select_dtypes(include=['category', 'object'])

# Exclude certain types
non_numeric = df.select_dtypes(exclude=['int64', 'float64'])

# Select columns whose names match a pattern
gdp_cols = df.filter(like='nat')  # Columns containing 'nat'
year_cols = df.filter(regex='^year')  # Columns starting with 'year', uses regular expression parameter

17.10.2 Creating New Columns

You can create new columns from existing ones using arithmetic operations, functions, or more complex transformations:

# Total medals
df["medals"] = df["gold"] + df["silver"] + df["bronze"]

# Using numpy functions
df["log_height"] = np.log(df["height"])

# Percentage change from previous year
df["GDP_growth"] = df["GDP"].pct_change() * 100

# Multiple operations
df["relative_size"] = (df["height"] / df["height"].mean()) * 100

# Conditional creation
df["won_something"] = np.where(
    df["medals"] > 0,
    "Yes", 
    "No"
)

17.10.3 Operations on Column Groups

You can apply operations to groups of columns using column selection methods. The .apply() function is particularly useful here - it allows you to run any function on each element (or column) of your data. Think of it as a more flexible version of a for-loop that works efficiently with pandas data structures.

# Demean all numeric columns 
# (subtract mean from each value)
numeric_cols = df.select_dtypes(
    include=['int64', 'float64']
)
df[numeric_cols.columns] = (
    df[numeric_cols.columns] - 
    df[numeric_cols.columns].mean()
)

# Scale all numeric columns to [0,1] range
def min_max_scale(x):
    return (
        (x - x.min()) / 
        (x.max() - x.min())
    )
    
# apply() runs min_max_scale on each column
df[numeric_cols.columns] = (
    df[numeric_cols.columns].apply(
        min_max_scale
    )
)

# Log transform all positive numeric cols
positive_cols = numeric_cols.columns[
    df[numeric_cols.columns].gt(0).all()
]
df[positive_cols] = np.log(df[positive_cols])

# Standardize all numeric cols (z-score)
df[numeric_cols.columns] = (
    (df[numeric_cols.columns] - 
     df[numeric_cols.columns].mean()) / 
    df[numeric_cols.columns].std()
)

The .apply() function is versatile:

It can work on rows or columns (specify with axis=1 for rows)
It accepts both built-in functions and custom functions you define
It’s often used for complex transformations that can’t be expressed in a single operation

17.11 Dropping rows and columns

Sometimes you need to remove unwanted rows or columns from your DataFrame. pandas provides the drop() function for this purpose:

17.11.1 Dropping Columns

# Drop a single column
df_no_weight = df.drop('weight', axis=1)

# Drop multiple columns
df_cleaned = df.drop(['weight', 'dob'], axis=1)

# Drop columns by position
df_subset = df.drop(df.columns[2:4], axis=1)  # Drop 3rd and 4th columns

# Drop columns matching a pattern
cols_to_drop = df.filter(like='_old').columns  # Columns containing '_old'
df_new = df.drop(cols_to_drop, axis=1)

17.11.2 Dropping Rows

# Drop rows by index label
df_no_usa = df.drop('USA')  # If 'USA' is an index value
df_subset = df.drop(['USA', 'UK'])  # Drop multiple rows

# Drop rows by position
df_no_first = df.drop(0)  # Drop first row
df_subset = df.drop([0, 2])  # Drop first and third rows

# Drop duplicate rows
df_unique = df.drop_duplicates()
df_unique_subset = df.drop_duplicates(subset=['Country'])  # Consider only 'Country' column

# Drop rows based on condition
df_high_gdp = df.drop(df[df['medals'] == 0].index)  # Drop rows where GDP < 1000

Remember that drop operations return a new DataFrame by default. To modify the original DataFrame, use inplace=True:

# Modify DataFrame in place
df.drop('weight', axis=1, inplace=True)

17.12 Missing Values

Real-world datasets often have missing entries, represented by NaN (Not a Number). For economists, missing data might arise from incomplete surveys, unavailable country statistics for certain years, or partial reporting by firms.

17.12.1 Finding Missing Values

pandas provides several functions for detecting missing values:

isnull() and isna(): Return a boolean mask indicating which values are missing
notnull() and notna(): Return a boolean mask indicating which values are not missing
sum(): When applied to boolean values, counts the number of True values

These functions are essential for data quality assessment, as missing values can significantly impact statistical analyses and model performance. Here’s how to use them:

# Boolean mask of missing values in column_x
missing_mask = df["column_x"].isnull()  # Returns True for missing values
print(missing_mask)  # Shows True/False for each row

# Count missing values in each column
missing_counts = df.isnull().sum()  # Counts missing values per column
print(missing_counts)  # Shows count for each column

# Calculate percentage of missing values
missing_percentages = (df.isnull().sum() / len(df)) * 100
print(missing_percentages)  # Shows % missing for each column

The boolean masks returned by these functions can be used for filtering or as part of more complex data cleaning operations.

17.12.2 Dropping Missing Values

# Drop any rows with at least one missing value
df_dropped_any = df.dropna()

# Drop rows only if a certain column has missing values
df_dropped_subset = df.dropna(subset=["column_x"])

17.12.3 Filling Missing Values

Instead of dropping missing rows, you might choose to fill them:

# Different methods for filling missing values:
df_filled = df["column_x"].fillna(0)     # Replace NaN with a constant (zero)
df_ffill = df["column_x"].fillna(method="ffill")  # Forward fill: copy the last valid value forward
df_bfill = df["column_x"].fillna(method="bfill")  # Backward fill: copy the next valid value backward

# Example with time series data:
#   Date    Value
#   Jan-01   100
#   Jan-02   NaN    -> ffill: 100, bfill: 150
#   Jan-03   150
#   Jan-04   NaN    -> ffill: 150, bfill: 200
#   Jan-05   200

# Fill with computed statistics
df_mean = df["column_x"].fillna(df["column_x"].mean())    # Fill with column mean
df_median = df["column_x"].fillna(df["column_x"].median()) # Fill with column median

# You can also fill with group-specific means
# For example, fill missing GDP values with the mean GDP for that country:
df["GDP"] = df.groupby("country")["GDP"].transform(
    lambda x: x.fillna(x.mean())
)

The appropriate method depends on your specific use case. Forward filling makes sense in time-series data when the last known value is a reasonable estimate for the missing one (like continuing a country’s GDP until new data is available). Backward filling might be preferred when future values are more relevant (like backfilling preliminary statistics with final numbers). Using means or medians is common when the missing values are believed to be random, while group-specific means can account for systematic differences between categories.

17.13 Merging and Joining DataFrames

When analyzing economic data, you often need to combine information from multiple sources. For example:

Computing GDP per capita requires:
- A DataFrame with GDP values
- Another DataFrame with population data
Analyzing international trade requires:
- Trade flow data between countries
- Exchange rate data for currency conversion
- Country-level economic indicators

The pd.merge() function lets you combine DataFrames based on common columns, like matching country names or dates. Let’s see how this works with a simple example:

# First DataFrame: GDP data
gdp_data = pd.DataFrame({
    'Country': ['USA', 'UK', 'Japan', 'Germany'],
    'GDP': [21.4, 2.7, 5.1, 4.2]
})
print("\nGDP Data:")
print(gdp_data)
#    Country   GDP
# 0  USA      21.4
# 1  UK        2.7
# 2  Japan     5.1
# 3  Germany   4.2

# Second DataFrame: Population data
population_data = pd.DataFrame({
    'Country': ['USA', 'UK', 'France', 'Japan'],
    'Population': [332, 67, 65, 126]
})
print("\nPopulation Data:")
print(population_data)
#    Country  Population
# 0  USA           332
# 1  UK             67
# 2  France         65
# 3  Japan         126

# Merge options demonstrate different ways to handle matches:

# 1. Inner join (default) - only keep countries present in BOTH DataFrames
inner_joined = pd.merge(
    gdp_data,
    population_data,
    on='Country',
    how='inner'
)
print("\nInner Join Result (only matching countries):")
print(inner_joined)
#   Country   GDP  Population
# 0    USA  21.4         332
# 1     UK   2.7          67
# 2  Japan   5.1         126

# 2. Left join - keep ALL countries from gdp_data
left_joined = pd.merge(
    gdp_data,
    population_data,
    on='Country',
    how='left'
)
print("\nLeft Join Result (all GDP countries):")
print(left_joined)
#    Country   GDP  Population
# 0     USA  21.4       332.0
# 1      UK   2.7        67.0
# 2   Japan   5.1       126.0
# 3 Germany   4.2         NaN

# 3. Right join - keep ALL countries from population_data
right_joined = pd.merge(
    gdp_data,
    population_data,
    on='Country',
    how='right'
)
print("\nRight Join Result (all Population countries):")
print(right_joined)
#   Country   GDP  Population
# 0    USA  21.4         332
# 1     UK   2.7          67
# 2  Japan   5.1         126
# 3 France   NaN          65

# 4. Outer join - keep ALL countries from BOTH DataFrames
outer_joined = pd.merge(
    gdp_data,
    population_data,
    on='Country',
    how='outer'
)
print("\nOuter Join Result (all countries):")
print(outer_joined)
#    Country   GDP  Population
# 0     USA  21.4       332.0
# 1      UK   2.7        67.0
# 2   Japan   5.1       126.0
# 3 Germany   4.2         NaN
# 4  France   NaN        65.0

The how parameter controls which rows to keep:

'inner': Only keep rows where the Country appears in both DataFrames (intersection)
'left': Keep all countries from the left (first) DataFrame
'right': Keep all countries from the right (second) DataFrame
'outer': Keep all countries from both DataFrames (union)

When a country exists in one DataFrame but not the other, missing values are filled with NaN.

17.13.1 Joining on Different Column Names

Sometimes the columns you want to join on have different names in each DataFrame. For example:

# GDP data uses 'Nation' while population data uses 'Country'
gdp_data = pd.DataFrame({
    'Nation': ['USA', 'UK', 'Japan'],
    'GDP': [21.4, 2.7, 5.1]
})

population_data = pd.DataFrame({
    'Country': ['USA', 'UK', 'France'],
    'Population': [332, 67, 65]
})

# Use left_on and right_on to specify different column names
merged_data = pd.merge(
    gdp_data,
    population_data,
    left_on='Nation',    # Column name in left (first) DataFrame
    right_on='Country',  # Column name in right (second) DataFrame
    how='outer'
)

print(merged_data)
#   Nation   GDP Country  Population
# 0    USA  21.4     USA       332.0
# 1     UK   2.7      UK        67.0
# 2  Japan   5.1    NaN         NaN
# 3    NaN   NaN  France        65.0

# The result contains both original columns
# You can drop the redundant one:
merged_data = merged_data.drop('Country', axis=1)

17.14 Reshaping Data with Pivot Tables

Economic data often comes in different formats depending on the source. Two common formats are:

Long format: Each row represents a single observation, with columns for identifiers (e.g., country, year) and values
Wide format: Each row represents an entity (e.g., country), with separate columns for each time period

Converting between these formats is a common task in economic analysis. Let’s see how to do this with pandas:

# Example: GDP by country and year in long format
gdp_long = pd.DataFrame({
    'Country': ['USA', 'USA', 'UK', 'UK'],
    'Year': [2020, 2021, 2020, 2021],
    'GDP': [20.9, 21.4, 2.6, 2.7]
})

print("\nLong Format:")
print(gdp_long)
#   Country  Year   GDP
# 0    USA  2020  20.9
# 1    USA  2021  21.4
# 2     UK  2020   2.6
# 3     UK  2021   2.7

# Convert to wide format using pivot
gdp_wide = gdp_long.pivot(
    index='Country',    # Rows
    columns='Year',     # Columns
    values='GDP'        # Values to fill
)

print("\nWide Format:")
print(gdp_wide)
#         2020  2021
# Country           
# UK       2.6   2.7
# USA     20.9  21.4

# More complex pivot with multiple values and aggregations
# Add trade balance data
gdp_trade = pd.DataFrame({
    'Country': ['USA', 'USA', 'USA', 'UK', 'UK', 'UK'],
    'Year': [2020, 2020, 2021, 2020, 2020, 2021],
    'GDP': [20.9, 20.9, 21.4, 2.6, 2.6, 2.7],
    'Trade_Balance': [-0.5, -0.6, -0.7, 0.2, 0.3, 0.1]
})

# Pivot with aggregation functions
gdp_summary = pd.pivot_table(
    gdp_trade,
    index='Country',
    columns='Year',
    values=['GDP', 'Trade_Balance'],
    aggfunc={
        'GDP': 'mean',           # Average if multiple values
        'Trade_Balance': ['mean', 'sum']  # Multiple aggregations
    }
)

print("\nComplex Pivot Result:")
print(gdp_summary)
#               GDP      Trade_Balance                    
# Year         2020  2021      mean              sum     
# Country                                               
# UK           2.6   2.7       0.25             0.6    
# USA         20.9  21.4      -0.55            -1.1

The pivot operations are particularly useful when: - Preparing data for time series analysis (wide format often preferred) - Creating summary tables for research papers - Computing year-over-year changes - Comparing multiple metrics across countries or time periods

17.15 Sorting

Sorting data is a fundamental operation in data analysis. pandas provides flexible methods to sort DataFrames by one or multiple columns, or by index values. This is particularly useful when working with economic time series or when preparing data for presentation.

17.15.1 Basic Sorting

# Sort by a single column
df_sorted = df.sort_values("GDP", ascending=False)

# Sort by multiple columns
df_sorted = df.sort_values(["Country", "Year"])

# Sort by index
df_sorted = df.sort_index()

17.15.2 Advanced Sorting Options

You can customize sorting behavior with additional parameters:

# Sort by multiple columns in different directions
df_sorted = df.sort_values(
    ["Country", "GDP"],
    ascending=[True, False]  # Ascending by Country, descending by GDP
)

# Handle missing values (NaN)
df_sorted = df.sort_values(
    "GDP",
    na_position="first"  # Put NaN values at the beginning
)

Remember that sorting operations return a new DataFrame by default. To modify the original DataFrame in place, use the inplace=True parameter.

17.16 Saving DataFrames

After manipulating your data, you’ll often want to save the results for future use or sharing with colleagues. pandas provides several methods to export DataFrames to various file formats, each suited to different needs.

17.16.1 Basic File Formats

For small to medium datasets, standard formats work well:

# Save to CSV (most common)
df.to_csv("output.csv")

# Save to Excel
df.to_excel("output.xlsx", sheet_name="Sheet1")

# Save to JSON
df.to_json("output.json")

17.16.2 Advanced Export Options

You can customize the export process with various parameters:

# CSV with specific encoding and date format
df.to_csv(
    "output.csv",
    encoding="utf-8",
    date_format="%Y-%m-%d",
    index=False  # Don't save index as a column
)

# Excel with multiple sheets
with pd.ExcelWriter("output.xlsx") as writer:
    df1.to_excel(
        writer, 
        sheet_name="GDP Data"
    )
    df2.to_excel(
        writer, 
        sheet_name="Population Data"
    )
    
# Compressed CSV for large datasets
df.to_csv("output.csv.gz", compression="gzip")  # gzip compression
df.to_csv("output.csv.bz2", compression="bz2")  # bzip2 compression
df.to_csv("output.csv.zip", compression="zip")  # zip compression

# Parquet format - excellent for large datasets
df.to_parquet(
    "output.parquet",
    compression="snappy"  # Fast compression, good for analytics
)

# Parquet with partitioning (e.g., by year and country)
df.to_parquet(
    "data_directory",
    partition_cols=["year", "country"]  # Creates nested directories
)

Remember to consider file size, readability, and compatibility when choosing an export format. CSV files are widely compatible but may lose data type information, while Excel files preserve formatting but may have size limitations.

17.17 Further Resources

For more advanced features and detailed documentation:

These resources provide comprehensive coverage of pandas’ capabilities, including advanced features not covered in this introduction.

17.18 Concluding Remarks

pandas is a cornerstone for data handling tasks in Python, especially when working with mixed-type, row-and-column datasets frequently encountered in economics. By combining the power of pandas with NumPy, you can:

Load and Inspect
Rapidly read data from files or databases and get immediate insights into your dataset’s shape, column names, and data types.
Filter and Subset
Flexibly select rows and columns using integer-based, label-based, or Boolean indexing—making it easy to isolate subsets for detailed analysis or cleaning.
Handle Missing Data
Identify, drop, or fill missing values in a systematic way, crucial for preparing data for econometric models or machine learning pipelines.
Aggregate
Compute descriptive statistics across the entire dataset or by groups. In economics, this is especially valuable for multi-country, multi-industry, or multi-year analyses.
Categorical and Continuous Data
Distinguish between data types for appropriate transformations and quick statistical summaries.

With these skills, you can confidently tackle most routine data management and exploratory tasks in Python. As your projects scale up or require more sophisticated operations, pandas provides additional features—merging, pivoting, time-series functionality, and more—that integrate seamlessly with the broader Python ecosystem for data analysis, econometrics, and machine learning.